I began applying for Information Technology jobs on August 11th, 2020. I used LinkedIn as my primary source of information. I varied between large and small companies, EasyApply and normal applications via company job application portals or website submissions, and various locations across the US. I even applied to a job in Sydney, Australia (by accident) and a job in Toronto, Canada.
I saved my application information in an Excel Spreadsheet as I applied. I decided to use it to practice some visualization techniques.
#General Packages.
from datetime import date
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
#Specific Packages
from wordcloud import WordCloud
import bar_chart_race as bcr
from IPython.display import Video
import plotly.graph_objects as go
from pywaffle import Waffle
#Read data.
df = pd.read_excel(r'D:\Data Science\To Work On\Jobs\Jobs.xlsx')
lat_lons = pd.read_excel(r'D:\Data Science\To Work On\Jobs\uscities.xlsx')
#Convert datetime columns to date.
df['Date_Applied'] = df['Date_Applied'].apply(lambda x: x.date())
df['Rejection_Email'] = df['Rejection_Email'].apply(lambda x: x.date())
df['Viewed_Email'] = df['Viewed_Email'].apply(lambda x: x.date())
#Current Numbers
print("As of " + date.today().strftime("%A, %B %d, %Y") + ", I have applied to " + str(df.shape[0]) + " jobs.")
This section looks at the specific job titles.
Below are the top 10 and bottom 10 (some of the single occuring) words from the job titles.
#Combine all job titles into one string.
jobs_string = ' '.join(df['Title'])
#Only letters are useful.
regex = re.compile('[^a-zA-Z]')
#Remove all non letters, and remove ' I ', ' II ', ' III '.
jobs_string = regex.sub(' ', jobs_string).replace(' I ',' ').replace(' II ',' ').replace(' III ',' ')
#Specific Replacement.
jobs_string = jobs_string.replace('AR VR','AR/VR').replace('C C', 'C2C').replace('Microsoft', 'Microsoft-365').replace('Non IT', 'Non-IT')
#Create a frequency distribution of all words in the job titles.
jobs_dict = {}
jobs_words = jobs_string.split()
for word in jobs_words :
if word not in jobs_dict.keys() :
jobs_dict[word] = 1
else :
jobs_dict[word] += 1
#Convert frequency distribution to dataframe, sort by frequency.
jobs_df = pd.DataFrame({'Word' : list(jobs_dict.keys()),
'Count' : list(jobs_dict.values())}).sort_values(by = 'Count', ascending = False, axis = 0).reset_index(drop = True)
jobs_df.head(10)
jobs_df.tail(10)
I thought a word cloud would be a fun representation to look at the job titles. The proportional size has been rescaled.
jobs_wc = WordCloud(background_color = 'white',
max_words = 300,
collocations = False,
relative_scaling = 0)
jobs_wc.generate(jobs_string)
plt.figure(figsize = (15, 11))
plt.imshow(jobs_wc, interpolation = 'bilinear')
plt.axis('off');
Some companies are hiring heavily. Some are recruiting and staffing agencies for others.
Once my application was in a system on a particular company's career portal page, it was easy to reapply. I used this quite a bit for companies like Amazon, Google, MITRE, and PayPal.
I used LinkedIn's EasyApply for many applications as well.
For the others, I sometimes applied as a guest, sometimes only had to upload my resume and cover, sometimes had to go through a 20 minute ordeal just for one opening. It varied. ¯_( ͡° ͜ʖ ͡°)_/¯
Below is a cumulative sum of applications per company and date, sorted alphabetically and chronologically, respectively.
#List of all companies.
companies = df['Company'].unique()
#Dates from first application to today.
date_index = pd.date_range(start = min(df['Date_Applied']), end = date.today())
#Create new data frame of 0s.
application_df = pd.DataFrame(index = date_index, columns = companies).fillna(0)
#Create cumulative count of job applications by company and date.
for i in range(len(df)) :
company = df.iloc[i, 1]
date_app = df.iloc[i, 7]
application_df.loc[date_app:, company] += 1
#Alphabetical
application_df = application_df.reindex(sorted(application_df.columns), axis=1)
pd.set_option('display.max_columns', None)
display(application_df)
#Total number of applications to each company.
cumulative_app_count = application_df.iloc[[-1]]
major_companies = cumulative_app_count.columns[(cumulative_app_count >= 5).iloc[0]]
minor_companies = cumulative_app_count.columns[(cumulative_app_count < 5).iloc[0]].tolist()
#Create a dummy company called 'Other', containing all companies with less than 4 applications.
major_df = pd.DataFrame.copy(application_df)
major_df['Other'] = major_df[minor_companies].sum(axis = 1)
major_df.drop(minor_companies, axis = 1, inplace = True)
#major_df.shape
Below is an animation of the above data. Bar chart races run more smoothly with larger numbers, like population, or monetary amounts, over longer periods of time. But I am happy the way this turned out.
bold_colors = ['#f0f0f0', '#3cb44b', '#e6194b', '#fffac8', '#9a6324', '#e6beff', '#fabebe', '#000075',
'#ffe119', '#008080', '#ffffff', '#4363d8', '#bcf60c', '#f58231', '#911eb4',
'#800000', '#aaffc3', '#808000', '#ffd8b1', '#46f0f0', '#f032e6', '#000000']
apps = bcr.bar_chart_race(df = major_df,
filename = 'applications.mp4',
orientation = 'h',
sort = 'desc',
n_bars = 15,
cmap = bold_colors[2:22],
filter_column_colors = False,
period_fmt = '%B %d, %Y',
period_label = {'x': 0.99,
'y': 0.15,
'ha': 'right',
'size': 14},
period_summary_func = lambda v, r: {'x': 0.99,
'y': 0.05,
's': f"{v.sum():,.0f} applications completed.\nCompanies with less than 4 applications are grouped in 'Other'.",
'ha': 'right',
'size': 8,
'weight': 'normal'},
title = 'Total Number of Jobs Applied to by Company and Date',
steps_per_period = 10)
Video("applications.mp4")
Some slight modifications were made from LinkedIn data during the application process:
In addition, for Common App, the job location was changed from none specified to Arlington. I did not learn about their opening from LinkedIn.
Several jobs were advertised with no city, only remote.
df[df['City'] == "Remote"]
I retrieved their office locations and manually entered them.
#Manual City, State_abbv, and State entry.
dict_cs = {193: ("Knoxville", "TN", "Tennessee"),
417: ("San Francisco", "CA", "California"),
418: ("New York", "NY", "New York")}
#Loop to manually enter missing City, State_abbv, and State entry for Remote locations.
for idx in dict_cs :
temp_city = dict_cs[idx][0]
temp_state_abbv = dict_cs[idx][1]
temp_state = dict_cs[idx][2]
df.at[idx, 'City'] = temp_city
df.at[idx, 'State_abbv'] = temp_state_abbv
df.at[idx, 'State'] = temp_state
Several jobs I applied to were in cities not in the spreadsheet I downloaded.
#Select relevant columns.
df_loc = df[['City', 'State_abbv', 'State', 'Date_Applied']]
#Select only city, state, and location columns.
lat_lons = lat_lons[['city', 'state_id', 'lat', 'lng']]
#Count the number of applications.
city_tally = df_loc.groupby(['City', 'State_abbv', 'State']).count().reset_index()
#Merge to get latitude, longitude for each city.
merged = pd.merge(city_tally,
lat_lons,
how = 'left',
left_on = ['City', 'State_abbv'],
right_on = ['city', 'state_id'])
#Several cities are not in the list.
merged[merged['city'].isna()]
Two were from Australia and Canada. For the others, I retrieved values from Google. An approximate average value was chosen for Dallas-Ft. Worth.
#Huxley job from Sydney, Australia was removed.
merged = merged[merged.State_abbv != 'AU']
#Prodigy Academy job from Toronto, Canada was removed.
merged = merged[merged.State_abbv != 'CN']
#Manual latitude and longitude entry.
dict_loc = {0: (42.6583, -71.1368),
8: (42.4906, -71.2760),
22: (40.5294, -75.3937),
34: (32.7598, -97.0646),
72: (38.2773, -76.4229),
112: (42.3668, -71.3031)}
#Loop to manually enter missing latitude and longitudes.
for idx in dict_loc :
lat = dict_loc[idx][0]
lon = dict_loc[idx][1]
merged.at[idx, 'lat'] = lat
merged.at[idx, 'lng'] = lon
#Rename columns and drop redundant columns
merged = merged.rename(columns = {'Date_Applied' : 'Count',
'lat': 'Latitude',
'lng': 'Longitude'}).drop(['city', 'state_id'], axis = 1)
My left join was 11 off. I soon determined that two cities in California are named Mountain View.
merged[merged.City == 'Mountain View']
I removed the one not in the Bay Area. All cities, with their state, number of applications, and location, are displayed below.
merged = merged[(merged['Latitude'] != 38.0093) | (merged['Longitude'] != -122.1169)]
merged
Below is the number of applications by state.
#Group job applications by State, count them, reset the index, drop the date, and rename City to Count.
state_tally = merged[['Count', 'State_abbv', 'State']].groupby(['State_abbv', 'State']).sum().sort_values(by = 'Count', ascending = False, axis = 0).reset_index()
state_tally
Below is an interactive map of the US. Applications are sorted by city and state.
#Singular or Plural.
def f(row) :
if row['Count'] == 1 :
string_val = ' application in '
else :
string_val = ' applications in '
return string_val
#Number of applications per city as text.
merged['Text'] = merged['Count'].astype(str) + merged.apply(f, axis = 1) + merged['City'] + ', ' + merged['State_abbv'] + '.'
#Number of applications per state as text.
state_tally['Text'] = state_tally['Count'].astype(str) + state_tally.apply(f, axis = 1) + state_tally['State'] + '.'
#Color in states by number of applications.
state_map_data = go.Choropleth(locations = state_tally['State_abbv'],
z = state_tally['Count'],
text = state_tally['Text'],
hoverinfo = 'text',
locationmode = 'USA-states',
colorbar = {'title': "<b>Applications</b>",
'thicknessmode': "pixels",
'thickness': 70,
'lenmode': "pixels",
'len': 400,
'titlefont': {'size': 16},
'tickfont': {'size': 12},
'tickvals': [0, 20, 40, 60, 80, 100]},
colorscale = 'Blues')
#Plot cities, size corresponds to number of applications.
city_map_data = go.Scattergeo(lon = merged['Longitude'],
lat = merged['Latitude'],
text = merged['Text'],
hoverinfo = 'text',
locationmode = 'USA-states',
marker = {'size': 10*np.sqrt(merged['Count']),
'color': 'Darkgreen'})
data = [state_map_data, city_map_data]
fig = go.Figure(data = data)
fig.update_layout(title = {'text': 'Where I Have Applied (Hover for Cities and Count)',
'font': {'size': 30}},
geo_scope = 'usa',
width = 950,
height = 550)
Below is the distribution of applications by state.
#Sort by number of applications for each state.
waffle_data = state_tally[['State', 'Count']]
#Add a row for states with less than 3 applications. 'Other' also includes Australia and Japan.
waffle_data = waffle_data.append(pd.DataFrame({'State': ['Other'], 'Count': [2]})).reset_index(drop = True)
to_drop = []
#Add applications from states with less than 3 to 'Other'.
for i in waffle_data.index :
if waffle_data.iloc[i]['Count'] < 3 :
temp = waffle_data.iloc[i]['Count']
waffle_data.at[32, 'Count'] += temp
to_drop.append(i)
#Remove states with less than 3. Change orientation of data.
waffle_data = waffle_data.drop(labels = to_drop, axis = 0).reset_index(drop = True).set_index('State').transpose()
fig = plt.figure(FigureClass = Waffle,
rows = 14,
values = waffle_data.values.tolist()[0],
labels = waffle_data.columns.tolist(),
figsize = (14, 9),
colors = bold_colors,
legend = {'loc': 'upper left',
'ncol': 2,
'fontsize': 13})